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In the context of Alzheimer's disease, two challenging issues are ( 1 ) the characterization of local hippocampal 
shape changes specific to disease progression and (2) the identification of mild-cognitive impairment patients 
likely to convert. In the literature, (1) is usually solved first to detect areas potentially related to the disease. 
These areas are then considered as an input to solve (2). As an alternative to this sequential strategy, we inves- 
tigate the use of a classification model using logistic regression to address both issues (1) and (2) simultaneously. 
The classification of the patients therefore does not require any a priori definition of the most representative hip- 
pocampal areas potentially related to the disease, as they are automatically detected. We first quantify deforma- 
tions of patients' hippocampi between two time points using the large deformations by diffeomorphisms 
framework and transport these deformations to a common template. Since the deformations are expected to 
be spatially structured, we perform classification combining logistic loss and spatial regularization techniques, 
which have not been explored so far in this context, as far as we know. The main contribution of this paper is 
the comparison of regularization techniques enforcing the coefficient maps to be spatially smooth (Sobolev), 
piecewise constant (total variation) or sparse (fused LASSO) with standard regularization techniques which do 
not take into account the spatial structure (LASSO, ridge and ElasticNet). On a dataset of 103 patients out of 
ADNI, the techniques using spatial regularizations lead to the best classification rates. They also find coherent 
areas related to the disease progression. 

© 2014 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license 

(http://creativecommons.Org/licenses/by/3.0/). 



1. Introduction 

Large scale population studies aim to improve the understanding of 
the causes of diseases, define biomarkers for early diagnosis, and devel- 
op preventive treatments. An important challenge for medical imaging 
is to analyze the variability in MRI acquisitions of normal control (NC), 
mild cognitive impairment (MCI), and Alzheimer's disease (AD) pa- 
tients. For Alzheimer's disease, several classification strategies have 
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been proposed to separate patients according to their diagnosis. These 
methods can be split into three categories: voxel-based (Fan et al., 
2007, 2008a,b; Kloppel et al., 2008; Lao et al., 2004; Magnin et al., 
2009; Vemuri et al., 2008), cortical-thickness-based (Desikan et al., 
2009; Kloppel et al., 2008; Querbes et al., 2009) and hippocampus- 
based (Chupin et al., 2007, 2009; Gerardin et al., 2009) methods. 
While decent classification rates can be achieved to separate AD from 
NC or NC from p-MCl (progressive MCI patients, i.e. converting to AD), 
all methods perform poorly at separating s-MCI (stable MCI patients, 
i.e. non-converting to AD) and p-MCI. A recent review comparing 
these methods can be found in Cuingnet et al. (201 1 ). 

In the case of longitudinal analysis, it is not anymore the shapes that 
are compared but their evolutions in time. To extract information be- 
tween two successive time-points, we use a one-to-one deformation 
which maps the first image onto the second one. Different registration 
algorithms are available to compute plausible deformations in this con- 
text. However, only one, the large deformations via diffeomorphisms 
(LDDMM) (Beg et al., 2005), provides a Riemannian setting that enables 
to represent the deformations using tangent vectors: initial velocity 
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fields or equivalently initial momenta. This can be used in practice to re- 
trieve local information and to perform statistics on it as presented in 
Vaillant et al. (2004) and Wang et al. (2007). In this direction, it is 
worth mentioning the study of Singh et al. ( 201 0) which shows the cor- 
relation between principal modes of deformation and diagnosis. In Qiu 
et al. (2008) the authors estimate the typical deformation of several 
clinical groups from the deformations between baseline and follow-up 
hippocampus surfaces. In order to compare this information across the 
population, we need to define a common coordinate system. This im- 
plies (1) the definition of a template and (2) a methodology for the 
transport of the tangent vector information. Note finally that, as far as 
the authors know, no paper explores binary classification using logistic 
regression in this context. 

Quality of shape descriptors with regard to the disease is often eval- 
uated through statistical significance tests or classification performance. 
In this paper, we evaluate descriptors on a binary classification task 
using logistic regression. 

In addition to its simplicity, it has the advantage of providing a map 
of coefficients weighting the relevance of each voxel. Such map can be 
used to localize the hippocampus deformations that are related to AD. 
However, the dimensionality of the problem (i.e. number of voxels p) 
being much higher than the number of observations (i.e. number of pa- 
tients n, p ~ 10 s » n ~ 10 2 ), the problem requires proper regularization. 

Now standard regularization methods such as ridge (Hoerl and 
Kennard, 1970), LASSO (Tibshirani, 1994) and Elastic Net (Zou and 
Hastie, 2005) do not take into account any spatial structure of the 
coefficients. 

In contrast, spatial models for regularizing supervised learning 
methods have been proposed in the literature (Grosenick et al., 2013; 
Jenatton et al., 2012; Ng and Abugharbieh, 2011). Total variation was 
used to regularize a logistic regression on functional MRI (flvIRI) data 
(Michel et al., 201 1 ). This method promotes coefficient maps with spa- 
tially homogeneous clusters. Fused LASSO was also used on fMRI data 
(Baldassarre et al., 2012; Gramfort et al., 2013). Similar ideas can be 
found in Cuingnet et al. (2012) where the authors defined the notion 
of spatial proximity to regularize a linear SVM classifier. 

In Durrleman et al. (2013), the authors introduce sparse parametri- 
zation of the diffeomorphisms in the LDDMM framework. Our goal is 
different: we want spatial properties (smoothness, sparsity, etc.) to be 
found across the population (i.e. on the common template) and we 
want this coherence to be driven by the disease progression. 

In this paper, we investigate the use of total variation, Sobolev and 
fused LASSO regularizations in 3D volumes. Compared to total variation, 
Sobolev enforces smoothness of the coefficient map, whereas fused 
LASSO adds a sparsity constraint. 

The deformation model used to assess longitudinal evolutions in the 
population is presented in Section 2. Machine learning strategies are 
discussed and the model of classification with logistic loss and spatial 
regularization is described in Section 3. The dataset used and numerical 
results are presented in Section 4. We illustrate that initial momenta 
capture information related to AD progression, and that spatial 
regularizations significantly increase classification performance. 
Section 5 concludes the paper. 
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2. Longitudinal deformation model for population analysis 

2.1. Global pipeline 

Let us assume that we have a population of patients and the binary 
segmentation of their hippocampus at two different time points, called 
screening and follow-up. Let us also assume that all patients have the 
same diagnosis at the screening time point, and only a part of them 
have converted to another diagnosis at the follow-up time point. Our 
goal is to compare patient evolutions, and classify them with regard to 
disease progression, i.e. stable diagnosis versus progressive diagnosis. 
From a machine learning point of view, we need to build features 
encoding the evolutions of the patients. 

We use the pipeline summarized in Fig. 1. First, the evolution de- 
scriptors are computed locally for each patient (independently). To be 
able to compare these descriptors, one needs to transport them into a 
common space. To do so, a population template is computed, towards 
which all the local descriptors are transported. Finally, classification is 
performed to separate progressive from stable patients. 

2.2. Diffeomorphic registration via geodesic shooting 

As mentioned in Sections 1 and 2.1, local deformation descriptors 
are computed to model the evolutions of the patients. In this section, 
we describe how we use diffeomorphic registration via geodesic shoot- 
ing Vialard et al. (2012a) to compute these local deformation 
descriptors. 

2.2.1. Definitions 

To register a source image I:flci 3 ->1 towards a target image J : 
O c R 3 — ► R, the LDDMM framework (Beg et al., 2005) introduces the 
following minimization problem 

argmin fa}-}^ +X f \\v t \? K dt, (1) 

ueL 2 ([0,l],W K ) 1 JO 

where v : (t,<w) e [0,1] x Q c R 3 — > Q is a time dependent velocity field 
that belongs to a reproducing kernel Hilbert space H K of smooth enough 
vector fields defined on II, and of associated kernel K and norm II 11^, 
and \ > 0 is a regularization coefficient. For (t,a>) e [0,1] x D, we note 
u t (co) = v(t, co). The deformation 4> ■ [0,1 ] 2 x Q c R 3 — ► Q is given by 
the flow of v t 

{d<j> 0[ 
- SF (co) = v c o4 >0[ (co) {2) 
*i,t(w) = w > 

where 4> n a is the deformation from t = t] to t = t 2 . Such approach in- 
duces a Riemannian metric on the orbit of /, i.e. the set of all deformed 
images by the registration algorithm (Miller et al., 2006). The first 
term in formula (1) is a similarity term controlling the matching 
quality whereas the second one is a smoothing term controlling 
the deformation regularity. Now noting l t A = lo <£~' and ] t A = J o $ tl , 



Screening and 
follow-up images 

<?.<n 



Local 
deformation 
descriptors {Pq} 
(see Section 2.2) 



Population 
template T (see 
Section 2.3) 



Transported 
deformation^ 
descriptors {Pq} 
(see Section 2.4) 



Classification 
(computation of 

a function 
/ : X -> y , see 

Section 3) 



Fig. 1. Four steps are needed to classify patient evolutions using local descriptors of shape deformations: (1 ) the local descriptors are computed for each patient independently, (2) a pop- 
ulation template is computed, (3) all local shape deformation descriptors are transported towards this template, and (4) classification is performed. 
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the Euler-Lagrange equation associated with Eq. (1) reads V (t,a>) < 
[0,1] x a. 



v t (a) = -JO (gradJ t (ffl)Jac^(ffl)(/ t (w)-J t (w))) 



(3) 



where K is the translation-invariant kernel of the reproducing kernel 
Hilbert space, * the convolution operator, grad the image gradient in 
space and Jac^ the Jacobian of 4>. 

For t e [0,1 ], let us define the momentum P t : Q — > M by 



Vffl ee fl, P t (<o) d l* ja C<( , t] (a)(I t (m)-J t (w)). 



(4) 



The Euler-Lagrange Eq. (3) can be rewritten as a set of geodesic 
shooting equations 



^(£o) + (grad J '(to), u t (to) > = 0, 

VM)eI °' 1]XA ^(») + d*r(P t (^(»,) = 0, 
I u t (t») + K * grad I t {a>) P t (a>) = 0, 



(5) 



where div is the divergence operator. 

Given an initial image 7 0 and an initial momentum P 0 , one can inte- 
grate the system ( Eq. ( 5 ) ) . Such a resolution is called geodesic shooting. 
We say that we shoot from I 0 using P 0 . 

The minimization problem (Eq. (1)) can be reformulated using a 
shooting formulation on the initial momentum P 0 

argmin 1 11/ o fa] -J|£ + \(grad I 0 P 0 , K * grad / 0 P 0 ) t 2 (6) 
p d 1 

subject to the shooting system (Eq. (5)). 

In order to solve the new optimization problem (Eq. (6)), we use the 
methodology described in Risser et al. (201 1 ) and Vialard et al. (2012a). 
Note that this methodology is similar to the one presented in Ashburner 
and Friston (201 1 ), but uses a different optimization strategy. 

For each patient, a two-step process was performed to encode the 
deformations of the hippocampus shape evolution from the screening 
image S (scanned at t = t 0 ) to the follow-up image F (scanned at 
t = t 0 + 12 months ) , as described in Fig. 2. First F was rigidly registered 
back to S. We note R:fic| 3 ->fl the rigid transformation obtained. 
Second, the geodesic shooting was performed with the screening 
image as source image (/ = S) point towards the registered followed- 



up image as target image (J = F ° Initial momenta from different 
patients are local descriptors that were used to compare hippocampus 
evolutions, such choice is further described in the next paragraph. 

2.2.2. Motivation and rationales for the use of initial momenta 

As written in the third row of Eq. (5), the velocity field v encoding 
the geodesic between the registered images has the following property 
at each time re [0,1 ] and at each coordinate co e Q, 



v t (m) = — /<*grad/ t (&>)P t (ft>). 



(7) 



We recall that l t , v t and P t are respectively the deformed source 
image, the velocity field and the momentum at time t. We also denote 
K* the convolution with the kernel K (typically Gaussian). Therefore, 
Eq. (7) can be read in the case of a binary image as follows: the unitary 
vector field normal to the shape surface is multiplied by a scalar field 
P(t) and this quantity gives the vector field v t once convolved with the 
kernel K. 

The system given in all rows of Eq. ( 5 ) leads to the fact that the initial 
momentum P 0 entirely controls the deformation for a given source 
image / 0 and a given kernel K. In the context of our study, longitudinal 
variations of the geodesies are relatively limited as only small deforma- 
tions are required to register pairs of hippocampi out of the same sub- 
ject. The displacement field can then be reasonably approximated by 
Id + Wo using a first-order expansion of Eq. (5). As a consequence, P 0 
can be directly interpreted as a value encoding expansions and contrac- 
tions of the shape if multiplied by — grad / 0 and then smoothed by K. 
Note also that the momentum is a scalar field, which is a more compact 
representation than a vector field. This motivates our approach. 

2.3. Population template 

2.3.1. Need for a template 

As mentioned in Section 2.1, local descriptors of hippocampus evolu- 
tions need to be transported in a common space prior to any statistical 
analysis. One way to obtain spatial correspondences between local de- 
scriptors of different patients consists in building a population template 
and then aligning these descriptors on the template. In the literature, 
template algorithms can be categorized into deterministic (Avants and 
Gee, 2004; Beg and Khan, 2006; Fletcher et al., 2004; Pennec, 2006; 
Vialard et al., 2011), probabilistic (Allassonniere et al., 2008; Ma et al., 
2008) and mixed (Bhatia et al., 2004; Jia et al., 2010; Joshi et al., 2004; 
Seghers et al., 2004) approaches. As described in Section 4.1, we want 




a) Step 1 : rigid registration. 




b) Step 2: geodesic shooting. 

Fig. 2. For each patient, the initial momentum encoding the hippocampus evolution is computed in a two-step process. 
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to build a population of binary images of hippocampi. As there is no 
variation of topology, and we want a template with sharp boundaries 
without averaging the gray levels, the first category is appropriate. 
Most algorithms in this category rely on the notions of Frechet and 
Karcher means, which we will now describe. 

2.3.2. Notions of Frechet and Karcher means 

In the Riemannian framework used for the geodesic shooting, a 
Frechet mean (Frechet, 1948) can be used to define an average shape 
from a population (Fletcher et al., 2004; Pennec, 1999, 2006). Given n 
images (S':QcR 3 -> R}-i < < „ and d a Riemannian metric on the 
space of images, the Frechet mean t : flcR 3 — > R is defined as a minimiz- 
er of the sum of the geodesic distances to all images 

In practice, such problem is often solved via an optimization proce- 
dure looking for a local minimum, and the solutions found are called 
Karcher means. For instance, a solution of Eq. (8) can be computed 
using a gradient descent procedure (Vialard et al., 2011). 

2.3.3. Invariance to rigid orientations, approximations and optimization 
procedure 

The problem (8) is not invariant with respect to the rigid orienta- 
tions of the input images, we modify the optimization problem to 

r& lp{ T ' si °( Ri ry> ( g > 

where {R l : Q — > Q} x < < n are rigid transformations. In this paper, we 
assume that the solution of Eq. (9) can be approximated by alternate 
minimization. It is also important to note that in the general case 
there is not necessarily unicity of the solution. 

When the [R'} are fixed, we follow the optimization strategy de- 
scribed in Vialard et al. (2011). Since the functional in Eq. (1) does not 
give a geodesic distance between two images — but between a source 
image and the deformed image, we approximate the minimization 
with regard to T by 

T ni d ( TJ tf> (10) 

where/! is the result of the shooting equations for the initial conditions 
/ = Tand P 0 = Po, where P\, is a minimizer of Eq. (6) with J = S' ° (J? 1 ) 1 . 
In this case, each term of the sum in Eq. (10) is equal to <grad/P 0 ,/<* 
(grad/P 0 )> t 2, and the gradient with regard to Tis 

-l^JOgradrpj,, (11) 

i=i 

where P 0 is the initial momentum matching T on S l ° (R') _1 via the 
shooting system (Eq. (5)). 

When lis fixed, we approximate the optimization over [R'}^ < ; < „ by 
performing rigid registrations matching each S' to T. 

Altogether, each update of the Karcher estimate is composed of four 
steps 

1. the images S' are rigidly aligned towards the current Karcher mean 
estimate T k , 

2. diffeomorphic registrations via geodesic shootings from the current 
Karcher estimate T k towards all the registered images S' ° (i? 1 ) -1 
are computed, 

3. geodesic shooting from T k using?™ eani = . p Q generates a deforma- 
tion field u mean , 



4. the composed deformation field u k+ i = u mea „ o u k is used to com- 
pute the updated estimate from the reference image. 

The advantage of computing the new estimate from a reference 
image is to avoid consecutive resamplings that would lead to smoothing 
and bias, as noted in Yushkevich et al. (2010). 

In the literature, the empirical convergence of the gradient descent 
procedure optimizing over T (with {Rj}i < * < n fixed) was studied in 
Vialard et al. (2011, 2012b). Similar tests are performed in Section 4.2 
for our procedure. 

2.4. Tangent information and associated transport 

2.4.1. Motivation and rationals 

The local descriptors computed for each patient as explained in 
Section 2.2 need to be transported in a common coordinate space: the 
space of the Karcher average defined in Section 2.3. 

There is still no consensus about the choice of which transport 
method should be used in our context. Different methods have been 
proposed. The first one is the transport of vector fields by the standard 
adjoint map. It was however shown that this method is not quite appro- 
priate for statistical study (Bossa et al., 2010). Parallel transport was also 
proposed in the context of LDDMM (Younes, 2007). Although it might 
seem relevant in our context, volume variation may be distorted. Note 
that its properties also depend on the deformation path and not only 
on the final deformation. 

In the context of LDDMM, another action of the group of deforma- 
tions on the momentum is called co-adjoint transport (Fiot et al., 
2012). This method only depends on the final deformation and pre- 
serves volume variation in the context of small deformations on binary 
images. This argument motivated its use in our study. 

2.4.2. Definitions 

A two-step process was then used to transport local descriptors of 
hippocampus evolutions to the template space (Fig. 3). First, the screen- 
ing hippocampus S' was registered towards the template T rigidly 
(Ourselin et al., 2001) then non-rigidly (Modat et al., 2010). The 
resulting deformation is denoted by <f>'. Second, this transformation 
was used to transport the local descriptors of hippocampus deforma- 
tions towards the template. 

We use the standard transport for a density Po : O c R 3 — » R, defined 

by 

Vaefl, P> 0 ((o)= ■det^jac^_ 1 (o))Jp' 0 o (<*>') _1 (o>), (12) 

where det is the notation for the determinant. Note that this action pre- 
serves the global integration of the density by a simple change of 
variable. 

3. Machine learning strategies 

3.3. Support vector machine classification 

In Fiot et al. (2012), SVM classifiers are used on different types of 
features. In that paper, local features obtained by integration of initial 
momenta on subregions provided the best classification results. 
This conclusion motivates the search for an optimal subregion fl r 
defining features as Xj d ='J" P 0 ((o)d(t) (optimal in terms of classification 
accuracy). This is equivalent to the search of the best indicator function 
l r : Q — > {0,1 }, or more generally a weighting function w : Q — > R defin- 
ing features by Xj d = j Q w(a>)P 0 ((o)d(o. 

To compute meaningful weighting functions, models where the fea- 
ture space is the same as the input space are of particular interest. Indeed 
as one coefficient corresponds to one voxel, meaningful spatial 
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Patient space Template space 

Step 1, Registration 



Screening image S 1 



Local deformation 
descriptor P 0 ' 



Deformation <l> 1 
\ 

Step 2. Transport 



Population template T 



Transported deformation 
descriptor P 0 ' 



Fig. 3. Local descriptors of hippocampus evolutions are transported to the template in a two-step process. First the deformation field from the patient space to the population template. 
Second, this deformation field is used to transport the local descriptors. 



regularizations can be introduced. This was used in the linear SVM 
setting in Cuingnet et al. (2012). In this paper, we exploit similar ideas 
on a classification framework with a logistic loss, which is well-suited 
for the introduction of spatial regularizations, easy to implement and 
that can be solved efficiently. 

3.2. Binary classification with logistic regression and spatial regularization 

32 A. Definitions 

Let us define a predictive model which reads 



y d = F(Xw+f>), 



(13) 



The standard elastic net regularization (Zou and Hastie, 2005) 
uses a combined / , and squared / 2 penalization A£N(w) = def Ai 1 1 w| 1 1 + 
A 2 ||w||| = ZIj=i^-i|Wj| + \ 2 wj , with the limit cases \ 2 = 0 referred 
to as LASSO (Tibshirani, 1994) and \^ = 0 referred to as ridge (Hoerl 
and Kennard, 1970). However as mentioned in Michel et al. (2011), one 
drawback of such methods is that they do not take into account any 
geometrical structure of w. Since coefficients are expected to be locally 
correlated in space, we investigate the Sobolev semi-norm, total variation 
semi-norm and fused-LASSO regularizations, respectively defined as 



SB(w) d = J2 llgrad Q w(eo) II 



(17) 



where y e {± 1 }" is the behavioral variable, X e R n x p is the design ma- 
trix containing n observations of dimension p, F is the prediction func- 
tion and (w,b) e E p x K are the parameters to estimate. In our 
application, each coefficient in y represents the disease progression of 
one of the n patients, and each row in X contains the initial momentum 
representing the deformations of the hippocampus of one of the n pa- 
tients. It is important to notice that each row in X is noted as a vector 
in M p in the formulation of the predictive model, but it is actually a scalar 
field in 3D. Similarly, w is noted as a vector in R p for the convenience of 
the formulation of the model, even if it also represents a scalar field in 
3D. Since each coefficient in w is associated to a spatial position, w is 
sometimes called a coefficient map. Such property allows us to detect 
(spatial) areas of interest, with regard to the machine learning problem 
we want to solve (see Section 3.2.4 about the interpretation of the solu- 
tion of the model). 

The logistic regression model defines the probability of observing y, 
given the data x, as 



p(y,|x i ,w ! b) d = 



1 + exp( 



1 

-y f (x[w + b))' 



(14) 



Given parameters ^w, b) and a new data point x the prediction is 
the maximum likelihood, i.e. class(x) = argmaXj, e (±1} p(y|x, w, = 
sign(x T w + bj . Accordingly the parameters are estimated as mini- 
mizers of the opposite log likelihood of the observations, considered 
as independent 



£(w,b) d 4 f ■l£log(l+exp( 

(=1 



-y, 



")))■ 



(15) 



Since the number of observations is much smaller than the dimen- 
sion of the problem (n <k p) minimizing directly the loss Eq. (15) 
leads to overfitting, and proper regularization is required. This is com- 
monly performed by introducing a regularization function J and the 
final problem becomes 



Find ( w, b ) in argmin C ( w. b) + \J ( w) . 

\ ' w.b 



(16) 



where A. is a coefficient tuning the balance between loss and 
regularization. 



TV(w) = Hgrad a w(w)ll 2 , 

oj-O 



AFL(w) = \ 1 TV(w) + ^llwll] 



(18) 



(19) 



The above sums go over all voxels co in the domain Q c R 3 , and 
gradj! is a linear operator implementing the image gradient by finite dif- 
ferences. By indexing each voxel co by integer coordinates on a 3D lat- 
tice, we define gradn by 



grad fl w (a ijk ) = 





1 A Q W 


>° 


clef. 










CO 



>a+w 



J i{i+\)k 



(20) 



whereA fl w(co 1 , C o 2 )= def i"'( £0 2)-w(c«) 1 ) i{{a„co 2 ) etf ^ tf fi _ 

° v 1 ■ 2! \0 otherwise, 
nition allows to restrain fl to any region of interest and boundaries of 
the domain are not penalized. Rationals and differences for those 
regularizations are discussed in Section 4. 

3.2.2. Solving the model 

Let us first study differentiability and convexity of the objective func- 
tion in Eq. (15). For convenience, we define w d = (w T , b) and for all i, 

f ' vT 1 ^ T , with associated data matrix X def 



xf= (xM) 
Then Eq. (15) becomes 



( X i)')l<i<n ^ 

1 <j<p + 1 



jnx(p+l) 



£ ( w ) =^S 1 °g( 1 + ex P(-3'iXiWjJ. 



(21) 



This loss function is twice differentiable and the non-negativity 
of V 2 £(w) establishes the convexity. 

When the regularization J is also convex and twice differentiable the 
reference optimization algorithms include quasi-Newton methods; in 
particular for large-scale problems the limited memory Broyden- 
Fletcher-Goldfarb-Shanno (LM-BFGS) is veiy popular. However non- 
differentiable regularizations such as total variation and fused LASSO 
optimization raises theoretical difficulties. Proximal methods such as 
monotonous fast iterative shrinkage thresholding algorithm (M-FISTA, 
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(Beck and Teboulle, 2009)) and generalized forward-backward (GFB, 
(Raguet et al., 2013)) have been considered. Unfortunately their low 
convergence rates are prohibitive for extensive investigation of the clas- 
sification scheme (parameter \, domain n, training design matrix X). 
Therefore we used the hybrid algorithm for non-smooth optimization 
(HANSO, (Lewis and Overton, 2012)) which is a LM-BFGS algorithm 
with weak Wolfe conditions line search. This addresses both the total 
variation semi-norm and the /ynorm, with almost everywhere 

VTV(w) = -div((||grad Q w(m)||^Vad fl w((a)) me J, 
V||w||, = (sign(w(<o))) Mea . 

3.23. Weighted loss function 

In supervised learning, classifiers trained with observations not 
equally distributed between classes can be biased in favor of the major- 
ity class. In order to alleviate this, several strategies can be used. One 
strategy is to restrict the training set to be equally distributed among 
classes. An alternative strategy is to use the full training set and intro- 
duce weights (q,), e [[ lin j] in the loss function as follows 

^q,log(l + exp(-y i x T w)) (22) 

where q, = def n/ (n c x cardjje [1 ..n]|yj = yj}),n c being the number of 
classes (2 in our case). When the observations are equally distributed 
among classes q, = 1 for all i and one retrieves (Eq. (21)), whereas 
q, < 1 (respectively q, > 1) when the class of observation i is over- 
represented (respectively under-represented) in the training set. 

32.4. Interpretation of the solution 

Another motivation for the use of the model presented in Section 3 is 
the possibility to interpret the computed solution. Let us remind that, 
after optimization, the solution is of the form (w, bj^MP x R. This 
solution can be used to predict the evolution ye {±1} of a new patient 
of associated initial momentum x e R p , by using the equation y = sign 
(x T w + bj. As mentioned in Section 3.2.1, the hyperplane w has the 
same dimension of the initial momentum, and each coefficient is associ- 
ated to one voxel. 

Now let us talk about the interpretation of the weights in w. High co- 
efficients in w correspond to areas of the hippocampus where the defor- 
mation is related to the disease progression. They are not areas of high 
expansions or contractions, and therefore have a different interpreta- 
tion than the coefficients in the initial momenta (see Section 2.2 for 
the interpretation of the coefficients of the initial momenta). On the 
contrary, coefficients close to zero in w represent areas where the values 
of x are not relevant to the disease progression ( in that case the values of 
x in these areas will not modify the value of the scalar product x T w). In 
that sense, the coefficients in w have a clinical interpretation. 

To summarize, each initial momentum can describe the local 
hippocampal shape changes for a patient taken individually, whereas 
the coefficient map w can describe the relevance of hippocampal 
areas with regard to the disease progression, at the population level 
i.e. from the observation of all training patients. 

4. Material and results 

4.1. Data 

Data used in the preparation of this article were obtained from the 
Alzheimer's Disease Neuroimaging Initiative (ADNI) database (http:// 
adni.loni.usc.edu ) . The ADNI was launched in 2003 by the National Insti- 
tute on Aging, the National Instituteof Biomedical Imaging and Bioengi- 
neering, the Food and Drug Administration, private pharmaceutical 
companies and non-profit organizations, as a $60 million, 5-year public 



private partnership. The primary goal of ADNI has been to test whether 
serial MRI, positron emission tomography, other biological markers, and 
clinical and neuropsychological assessment can be combined to mea- 
sure the progression of MCI and early AD. Determination of sensitive 
and specific markers of veiy early AD progression is intended to aid re- 
searchers and clinicians to develop new treatments and monitor their 
effectiveness, as well as lessen the time and cost of clinical trials. 

The Principal Investigator of this initiative is Michael W. Weiner, MD, 
VA Medical Center and University of California — San Francisco. ADNI is 
the result of efforts of many co-investigators from a broad range of 
academic institutions and private corporations, and subjects have 
been recruited from over 50 sites across the U.S. and Canada. The initial 
goal of ADNI was to recruit 800 subjects but ADNI has been followed by 
ADNI-GO and ADNI-2. To date these three protocols have recruited over 
1500 adults, ages 55 to 90, to participate in the research, consisting of 
cognitively normal older individuals, people with early or late MCI, 
and people with early AD. The follow-up duration of each group is spec- 
ified in the protocols for ADNI-1, ADNI-2 and ADNI-GO. Subjects origi- 
nally recruited for ADNI-1 and ADNI-GO had the option to be followed 
in ADNI-2. For up-to-date information, see http://www.adni-info.org. 

A dataset of 206 hippocampus binary segmentations from 103 
patients enrolled in ADNI (Mueller et al., 2005) has been used. The seg- 
mentations were computed and provided by ADNI, detailed information 
can be found on their website. For each patient, 'screening' and 'month 
12' were the two time points selected. All patients were MCI at the 
screening point, 19 converted to AD by month 12, and the remaining 
84 stayed MCI. 

42. Experiments 

42 A. Preprocessing 

First, all screening images were resampled to a common isotropic 
voxel size 1.0 x 1.0 x 1.0 mm, similar to their original size. Rigid trans- 
formations aligning the month 12 hippocampus towards the screening 
ones were computed using Ourselin et al. (2001). 

422. Computation of initial momenta 

The geodesic shootings (Vialard et al., 2012a) were performed 3 
using a sum of three kernels (sizes 1, 3 and 6 mm, with respective 
weights 2, 1 and 1 ), and 200 gradient descent iterations. To check the 
quality of the geodesic shooting computed for each patient i (second 
step in Fig. 2), the evolution of the Dice score DSC between S\ which is 
the deformed screening image at time t and the target image F 1 ° (i?) -1 
was computed, and the average final DSC is 0.94 ± 0.01. 

42.3. Computation of the template 

The computation of a Karcher mean as described in Section 2.3 is a 
computationally expensive step, which is linear with the number of im- 
ages. Therefore it can be desirable to select only a subset of the images. 
In this paper, a subset of 20 images was used, of corresponding 
hippocampal volumes which were the closest to the mean hippocampal 
volume. The Karcher mean estimate was updated four times, with 
respectively 200, 150, 150 and 100 gradient descent iterations in the 
geodesic shootings. Below are two verifications we performed to 
validate this approach. 

First, we evaluated if all patients can be registered properly to the 
template, which is an important verification since only a subset of the 
images was used to compute the template. In our study, the average 
Dice score between the 103 registered patients and the template was 
0.87 ± 0.02, which validated the suitability of the template obtained 
for our study. The last paragraph of Section 4.3 also provides another 
reason why such template can be used in our study. 



3 http: //sourceforge.net/projects/utilzreg/. 
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Fig. 4. Empirical measures of convergence of the Karcher template algorithm. On this dataset, we notice that ( 1 ) the convergence speeds are coherent with the ones presented in Vialard 
et al. (2011) and Vialard et al. (2012b), i.e. only a few Karcher iterations are required for convergence, and (2) the alternate minimization overland {R,}i < , < „ provides a faster conver- 
gence than the one over 7 with the {RJ fixed. 



Second, we evaluated the empirical convergence of our optimization 
procedure. Fig. 4a shows the relative distance to the final estimate, i.e. 



image and the template (i.e. DSC(S ° (R') \T)) and between the final 
registered screening image and the template (i.e. DSC(S ° (dV)~\T)). 
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where T k is the Karcher estimate at iteration k, and T„ is approximated 
by the last computed estimate. Fig. 4b shows the relative distance 
between two consecutive estimates, i.e. 



iir 



k+l" 



(24) 



with the same notations. On this dataset, we notice that (1 ) the conver- 
gence speeds are coherent with the ones presented in Vialard et al. 
(201 1, 2012b), i.e. only a few Karcher iterations are required for conver- 
gence, and (2) the alternate minimization over land [R^ < f < „ provides 
a faster convergence than the one over T with the {R,} fixed. 

4.2.4. Transport of initial momenta 

To compute the transformations dV from the screening hippocampi 
towards the template (Fig. 3), rigid (Ourselin et al., 2001) then non- 
rigid (Modat et al., 2010) registration algorithms were applied with 
their default parameters. To check the quality of the registration dV com- 
puted to transport the local descriptor of the patient i (first step in 3), 
the Dice score was computed between the rigidly registered screening 



4.2.5. Computation of the region of interest O s 

The region of interest Il s was restricted around the surface of the 
template (see Fig. 5), where the high values of the initial momenta lie. 
Moreover, this allows greater differences of coefficient values from 
one side to the other when using Sobolev regularization. 

More specifically, given a binary template T : Q c M 3 — > [0,1] and a 
spherical structural element £ r of radius re K defined as 

£ r d = ■{((»,, <o 2 ,ffl 3 )eR 3 ; oi] + o)l + o>3<r 2 }, (25) 
we define the region of interest fi s as 



Q s = Dila(T,£ r ) - Ero(T,£ r ) 



(26) 



where Dila and Ero are the standard dilatation and erosion morpholog- 
ical operators. In this study, using r = 5, the ROI fl s contained 
12,531 voxels. 

4.2.6. Optimization of the logistic regression model 

In the training procedure, we have n = 103 observations (one 
for each patient). As initial momenta are scalar fields in space, each 
initial momenta has the same dimension as the number of voxels, so 
p = 12,531. Since stable and progressive classes in the dataset are 




a) Template T b) Region of interest Q. s 



Fig. 5. The region of interest !! s (visualized with transparency) is designed to select voxels close to the boundary (i.e. close to the surface) of 7. It is obtained via standard morphological 
operations, and in this study fi s contains 12,531 voxels. 
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unbalanced, the weighted version of the loss function defined in 
Section 3.2.3 was used. Solution of the optimization problems was com- 
puted via HANSO 4 with a maximum of 20 iterations. 

4.2.7. Performance evaluation 

First, the effect of spatial regularizations was compared. The spatial 
regularizations introduced in Section 3.2 aim at enforcing local correla- 
tions between the coefficients in w. Using the whole dataset, the effects 
of the various regularizations were compared. Second, the model was 
evaluated in terms of classification of AD progression. All patients 
were classified using a leave-10%-out scheme. From the numbers of 
true/false positives/negatives (TP, FP, TN, FN), four indicators were 
used to measure classification accuracy: specificity Spec d = jJ^ Fp . 
sensitivity Sens d = „„ rP „„ negative predictive value NPV A = ' ™ r .,„ and 

J TP + FN ° drf JL TN + FN 

positive predictive value PPV — ^. Statistical tests were also per- 
formed to evaluate the significance of the differences. Using N = 50 
random re-orderings of the patients, the Spec + Sens variable was com- 
puted 50 times for each regularization and two-sample t-tests were 
performed. 

4.3. Effect of spatial regularizations 

When using standard regularizations, increasing the regularization 
does not lead to any spatial coherence (Fig. 6a, b and c). It is interesting 
to remark that LASSO regularization emphasizes a more limited number 
of points than ridge regularization. This is particularly clear in the right 
columns of Fig. 6, where the regularization energy (\J(w) in Eq. (16)) 
has a significant weight in the total energy. As expected, ElasticNet also 
gives results which are in-between those of LASSO and those of ridge. 
In contrast to these regularization techniques, the higher the spatial 
regularizations, the more structured are the coefficients. Note that 
delimited areas are coherent across different spatial regularizations. 
Sobolev regularization leads to smooth coefficient maps (Fig. 6d) where- 
as total variation tends to piecewise constant maps (Fig. 6e). Finally, 
fused LASSO adds sparsity by zeroing out the lowest coefficients (Fig. 6f). 

4.3.3. Another benefit of spatial regularizations 

As mentioned in the Introduction, a motivation to regularize the 
learning problem is the low number of observations compared to the di- 
mensionality of the problem. However, we can infer another benefit of 
the use of spatial regularizations. Indeed, to build voxel-based statistical 
models from the observations of several patients, one needs to align 
these observations properly. Even though we checked the quality of 
the alignment to the template, such alignment is not perfect. Adding 
spatial regularizations in the model is a way to limit the effects of the 
alignment errors. 

4.4. Classification of Alzheimer's disease progression 

Besides providing a map of coefficients indicating the importance of 
each voxel with regard to the disease progression, the model presented 
in this paper can be used to classify the disease progression of new pa- 
tients. Table 1 displays the classification performance indicators of bina- 
ry classification using logistic loss and various regularizations. 

Without any regularization, the resulting classifier always predicts 
the same class. Before going any further, let us comment on this point. 
If all testing subjects are classified in the same class, it means that all 
the testing points are on the same side on the hyperplane found in the 
optimization process. Here, unbalanced observations and the chosen 
optimization strategy are the causes of this result. In the model used, 
the bias b plays a special role and several strategies can be considered, 
such as 1 ) optimizing w and b at the same time, 2) optimizing w and 



4 http://www.cs.nyu.edu/overton/software/hanso. 



b, then freezing w and optimizing b, 3) optimizing w and b, then freez- 
ing w and setting b using heuristic rules (e.g. setting it to have the same 
ratio between classes in training and test sets), 4) optimizing w with b 
frozen to zero, then optimizing b, 5) optimizing w with b frozen to 
zero, then setting b using heuristic rules, etc. In initial tests, we realized 
that some strategies would classify all patients to positive whereas 
other would classify them all to negative. This happened when the op- 
timization is not regularized. However, this instability with regard to 
the optimization strategy fades out when the problem is regularized. 
These initial tests further motivated the use of regularization. Let us 
note that the above strategy 1 ) was used in all the results presented in 
this paper. 

All regularizations improve significantly the classification perfor- 
mance, the top 3 being the three spatial regularizations. On this dataset, 
fused LASSO is the one providing the best results (Spec + Sens = 1 .32), 
closely followed by total variation (Spec + Sens = 1.31). 

4.4.1. Comparison with the literature 

Using spatial regularizations such as total variation and fused- 
LASSO, our experiments provide higher performances than the best 
one reported in Fiot et al. (2012) (Spec + Sens = 1.27). Moreover, 
the linear classification model used in this paper is simpler than the 
non-linear SVM used in Fiot et al. (2012). SVM is a very powerful ap- 
proach, which has been widely studied and successfully used. Many 
implementations are available, but it can get difficult to modify them 
and, for example, add spatial regularizations. Besides, only linear SVM 
can provide an interpretable map of coefficients, but not the non- 
linear version used in Fiot et al. (2012). On the other hand, a model as 
simple as the logistic regression can be easily implemented and 
modified. 

4.5. Statistical tests 

To evaluate the significance of the performance differences found in 
Table 1, we performed two-sample t-tests. The variable considered was 
Spec + Sens, and 50 realizations of the variable from random re- 
ordering of the patients were obtained for each sample. Two 
regularizations can be considered statistically significantly different if 
the test has a p-value p < a = 10~ 3 . These results are presented in 
Table 2. First, we notice that all regularizations are statistically better 
than the absence of regularization. Then we notice that all spatial 
regularizations are statistically better than standard regularizations. Fi- 
nally, we notice that despite higher prediction accuracy, Elastic Net is 
not statistically significantly better than ridge in our tests. Similarly, 
fused-IASSO is not statistically significantly better than total variation 
in our tests. 

4.6. Computation time 

The various algorithms were implemented in a mix of C++, 
MATLAB®, mex and python. Table 3 reports approximate running time 
on a standard laptop (Intel® Core™ i7-2720QM CPU at 2.20 GHz, 8 GB 
of RAM). The geodesic shooting step is linear with the number of pa- 
tients. The computation of the template is linear with the number of pa- 
tients and the number of Karcher iterations. One should note that 
Karcher iterations can have decreasing number of gradient descent iter- 
ations, which decreases the total computation time. Then the transport 
is also linear with the number of patients. So far, it is interesting to no- 
tice that all the steps can be easily be divided into different jobs to take 
advantage of multi-core or distributed architectures. Finally come the 
learning and classification. The computation time of this step can vary 
dramatically depending on several parameters such as the training/test- 
ing splitting scheme, the optimization algorithm, and the number of 
regularization parameters to test. In particular, for this exploratory 
study, we used mainly HANSO algorithm, since the convergence rate 
of the proximal algorithms mentioned in Section 3.2.2 was too low. 
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Fig. 6. Effects of various regularizations on the solution w of the optimization problem. Each small image represents the coefficients of one 2D slice of w, which is a 3D volume. Zero 
coefficients are displayed in light green, higher values are going red and lower values are going blue. On each row, the regularization is increasing from left to right, and the 1 Oth and 
90th percentiles of the coefficients (resp. P 10 and P 90 ) correspond to the saturation limits of the colorbar. Panels a, b and c show standard regularizations whereas Panels d, e and f 
show spatial regularizations. Spatial regularizations provide more structured coefficients. 
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Table 1 

Prediction accuracy of MCI patients' progression. 



Regularization 


X 

range 


X 

(optimal /.) 


Spec+ 
Sens 


Spec 


Sens 


NPV 


PPV 


None 


0 


0 


1.00 


0.00 


1.00 


NaN 


0.18 


Standard 


LASSO 


[10- H ,10 U ] 


0.01 


1.04 


0.20 


0.84 


0.85 


0.19 


Ridge 


[10" B ,10 U ] 


0.001 


1.06 


0.95 


0.11 


0.82 


0.33 


Elastic Net 


[io- 9 ,io 0 ] 2 


r%i = o.oi 

I 5,2=1 


1.13 


0.29 


0.84 


0.89 


0.2 


Spatial 


Sobolev 


[10- 9 ,10 7 ] 


10 4 


1.17 


0.54 


0.63 


0.87 


0.24 


Total Variation 


[10" B ,10 U ] 


0.01 


1.31 


0.46 


0.84 


0.93 


0.26 


Fused LASSO 


[10- 9 ,10 0 ] 2 


r %i = o.oi 
1 5. 2 = io" 4 


1.32 


0.48 


0.84 


0.93 


0.27 



4.7. Comparison with the literature 

As mentioned earlier, the main contribution of this paper is the com- 
parison of the effects of various regularizations on the solution of binary 
classification problem with a logistic loss. In the context of longitudinal 
Alzheimer's disease study, we saw that the use of spatial regularizations 
techniques was not only leading to better classification results than 
standard regularizations, but also providing maps of coefficients with 
improved spatial coherence. 

In the literature, a large number of methods are also trying to iden- 
tify the hippocampal sub-areas that are related to either the conversion 
of patients to the disease or to other symptoms such as cognitive or 
memory measures. For example, one can cite Fig. 5 of Frisoni et al. 
(2008), Fig. 7 of Gutman et al. (2009), Fig. 1 to 5 of Apostolova et al. 
(2010), and Fig. 3 and 4 of Shen et al. (2012). 

Several strategies can be considered to compare the most signifi- 
cant regions found by various methods. One strategy is to transport 
relevance maps from different methods to the same space. However, 
transporting information is delicate (Fiot et al., 2012), and one 
needs to be cautious with such strategy. This transport could be 
avoided by using the same template for all methods, though this is 
likely to cause problems if the population studied is not the same. 
Another strategy is to rank the hippocampal subareas, as it is done 
for example in Table 2 of Frisoni et al. (2008), and compare the rank- 
ings. This strategy would require us to align a map of known hippo- 
campal subareas to our template, and design a ranking algorithm (for 
example based on J w(w) 2 d(u , where Cl R is a hippocampal 
subregion). 

Comparing qualitatively or quantitatively the subregions that are the 
most significant with regard to disease progression is out of the scope of 
this paper. Nonetheless, it is a very interesting perspective, and several 
strategies including the ones mentioned above are considered for future 
work. 



5. Conclusion 

In this paper, we studied deformation models for longitudinal popu- 
lation analysis, regularizations and machine learning strategies. In par- 
ticular, we investigated the combined use of the LDDMM framework 
and classification with logistic loss and spatial regularizations in the 
context of Alzheimer's disease. Results indicate that initial momenta of 
hippocampus deformations are able to capture information relevant to 
the progression of the disease. 

Another contribution of this paper is the joint use of a simple linear 
classifier with complex spatial regularizations. Achieving results higher 
than the ones reported in Fiot et al. (2012), which uses non-linear SVM 
classifier, our method provides in addition coefficient maps with direct 
anatomical interpretation. 

Moreover, we compared Sobolev, total variation and fused LASSO 
regularizations. While they all successfully enforce different priors 
(respectively smooth, piecewise constant and sparse), their resulting 
coefficient maps are coherent one to the other. They improve coefficient 
maps and their classification performances are statistically better than 
the ones obtained with standard regularizations. 

Now the ideas and results presented in this paper open a wide range 
of perspectives. First, the question of the representation of patients from 
images, and in particular the representation of their evolutions for lon- 
gitudinal population studies was raised. We have used initial momenta 
encoding the patient evolution in 3D volumes. An interesting research 
direction is the adaptation of our pipeline to surface representation of 
shape evolution. Indeed, as we saw in the application studied in this 
paper, the strong values of the initial momenta lie on the hippocampus 
volume boundary, in other words on the surface. Second, the question of 
how to compare evolutions of different patients was raised. We studied 
the use of Karcher mean and the importance of the regularizations. Even 
though diffeomorphic deformation models such as LDDMM can provide 
smooth deformation fields and encode the shape deformation of a 



Table 2 

Statistical p-values of two-sample t-tests between different regularizations. The variable considered is Spec + Sens, and 50 realizations of the variable from random re-orderings of the 
patients were obtained for each sample. Two regularizations can be considered statistically significantly different if the test has a p-value light green p < a = 10~ 3 (marked in green, 
red otherwise). 



Regularization 


Standard 


Spatial 


LASSO 


Ridge 


Elastic Net 


Sobolev 


Total Variation 


Fused LASSO 


None 


<10" 5 


<10" 5 


<10" 5 


<10" 5 


<10" 5 


<10" 5 


Standard 


LASSO 




iri0 -04 


<10" 5 


<10" 5 


<10" 5 


<10" 5 


Ridge 






4.2*1 0" 02 


<10" 5 


<10" 5 


<10" 5 


Elastic Net 








6.3*10-° 5 


<10" 5 


<10" 5 


Spatial 


Sobolev 










<10~ 5 


<10" 5 


Total Variation 












0.86 


Fused LASSO 
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Table 3 

Computation time of the various steps. ( ): can differ by several orders of magnitude, see 
Section 4.6 for details. 



Step 


Computation time 


Preprocessing 


A few hours 


Geodesic shooting 


«1 day 


Template computation 


w3 days 


Transport 


«1 day 


Learning and classification 


From 1 min to several days ' 



patient in a smooth representation, we saw that it is important to regu- 
larize spatially across the population (i.e. between patients) in order to 
be able to build meaningful statistical models for classification and bio- 
marker discovery. On that point, the logistic regression model has prov- 
en to be efficient as it can be combined with complex regularizations. 
Our spatial regularizations gave the best results on our dataset, and an- 
other research direction is the study of other regularizations such as 
group sparsity. Third, another great perspective of this work consists 
in studying evolutions of patients with more than two time points. In 
this context, the design of spatio-temporal regularizations (for example 
in the context of geodesic regression (Niethammer et al., 2011)) is an 
exciting research direction. 
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